Boston: A Cold City

Introduction

Boston is a cold city. The weather in Boston affect the city in various aspects. Among these, we are mainly looking for the relationship between weather and 311 reports.

Methods

Questions we've asked

  • # of reports that contain the word snow when there is no snow on that day (and vise versa)
  • Do 311 requests come mostly from one region?
  • Is median household income correlated with number of 311 requests?
  • Within how many days of snowfall do requests for plowing happen?

Cleaning up our Data

In [1]:
# All imports go in this block
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

First, we load our weather dataset and prepare it

In [2]:
df_weather = pd.read_csv("data/Boston weather_clean.csv")
In [3]:
consecutive_snow_days = [0]
days_since_last_snow = [365]
accumulated_snow = [0]
for index, row in df_weather.iterrows():

    if row['Events'] == 'Snow' or row['Events'] == 'Both' or row['Snowfall (in)'] > 0:
        if consecutive_snow_days[-1] == 0:
            accumulated_snow.append(float(row['Snowfall (in)']))        
        else:
            accumulated_snow.append(accumulated_snow[-1] + row['Snowfall (in)'])
        
        consecutive_snow_days.append(consecutive_snow_days[-1] + 1)
        days_since_last_snow.append(0)
    else:
        accumulated_snow.append(accumulated_snow[-1])
        consecutive_snow_days.append(0)
        days_since_last_snow.append(days_since_last_snow[-1] + 1)
df_weather['consecutive_snow_days'] = consecutive_snow_days[1:]
df_weather['days_since_last_snow'] = days_since_last_snow[1:]
df_weather['accumulated_snow'] = accumulated_snow[1:]
In [4]:
# Load 311 dataset
df_311 = pd.read_csv('data/311.csv')
/Users/matt/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3049: DtypeWarning: Columns (13) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [5]:
# Split out the date column into separate Year Month and Day columns
df_311['open_dt'] = pd.to_datetime(df_311['open_dt'])
df_311['Year'] = df_311['open_dt'].apply(lambda date: date.year)
df_311['Month'] = df_311['open_dt'].apply(lambda date: date.month)
df_311['Day'] = df_311['open_dt'].apply(lambda date: date.day)
In [6]:
df_311_weather = pd.merge(df_311, df_weather, on=['Year', 'Month', 'Day'])
In [7]:
df_income = pd.read_csv('data/median income.csv', delimiter=";")
In [8]:
def convert_neighborhood(neighborhood):
    conversions = [("Allston / Brighton", "Allston/Brighton"),
                   ("Allston", "Allston/Brighton"),
                   ("Brighton", "Allston/Brighton"),
                   ("Back Bay", "Back Bay/Beacon Hill"),
                   ("Beacon Hill", "Back Bay/Beacon Hill"),
                   ("Fenway / Kenmore / Audubon Circle / Longwood", "Fenway/Kenmore")]
    for left, right in conversions:
        if neighborhood == left:
            return right
    return neighborhood
In [9]:
df_311_weather['neighborhood'] = df_311_weather['neighborhood'].apply(convert_neighborhood)
In [10]:
# Clean up dollar strings into numbers
from re import sub
from decimal import Decimal
df_income['median household income'] = df_income['median household income'].apply(lambda money: Decimal(sub(r'[^\d.]', '', money)))
In [11]:
df = pd.merge(df_311_weather, df_income, left_on="neighborhood", right_on="region", how="left")
winter_2014_2015 = (df['Year'] == 2014) & (df['Month'] > 8) | ((df['Year'] == 2015) & (df['Month'] < 5))

Look into "Both" - is Both worse than either Rain or Snow

Results

In [12]:
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np

df["case_title"] = df["case_title"].astype(str)

# lists all the data in which there was snowfall
def graph_snow_requests_vs_actual_requests(df):
    snowed = df["Snowfall (in)"] > 0
    snow_data = df[snowed]

    #311 report has the word snow in it
    has_snow = df["case_title"].str.contains("Snow")
    reason_snow = df[has_snow]
    
    # lists all the data in which there wasn't snowfall
    no_snow = df["Snowfall (in)"] == 0

    #trying to get dataset in which there wasn't snow but still had a report on snow
    no_snow_data = df[no_snow & has_snow]
    
    # count in which there was no snow yet reports on 311 related to snow
    count_no = no_snow_data['case_title'].count()
    yes_snow_data = df[snowed & has_snow]
    
    # count in which there was snow AND reports on 311 related to snow
    count_yes = yes_snow_data['case_title'].count()

    # 311 requests that do not have the word "snow" in it
    no_word = df["case_title"].str.contains("Snow") == False
    reason_nosnow = df[no_word]
    
    # no snow in 311 report but yes snow in terms of weather
    yesSnow_noReport = df[snowed & no_word]
    
    # count in which there was no snow in the 311 report but yes snow in terms of weather
    count_noSnowReport = yesSnow_noReport['case_title'].count()
    
    # no snow in 311 report and no snow in terms of weather
    noSnow_noReport = df[no_snow & no_word]
    
    # count in which there was no snow in 311 report and no snow in terms of weather
    count_noSnownoReport = noSnow_noReport['case_title'].count()

    # First graph
    x_pos = ('"snow" exists', '"snow" doesnt exist')
    y_pos = np.arange(len(x_pos))
    performance = [count_yes,count_noSnowReport]

    plt.bar(y_pos, performance, align='center', alpha=0.5)
    plt.xticks(y_pos, x_pos)
    plt.ylabel('Number of 311 requests')
    plt.xlabel('Whether the word "snow" exists within the 311 report title')
    plt.title('Having the word "snow" in 311 requests when there WAS snowfall')

    plt.show()
    
    # Second graph
    x_pos = ('"snow" exists', '"snow" doesnt exist')
    y_pos = np.arange(len(x_pos))
    performance = [count_no,count_noSnownoReport]

    plt.bar(y_pos, performance, align='center', alpha=0.5)
    plt.xticks(y_pos, x_pos)
    plt.ylabel('Number of 311 requests')
    plt.xlabel('Whether the word "snow" exists within the 311 report title')
    plt.title('Having the word "snow" in 311 requests when there WAS NO snowfall')

    plt.show()
In [13]:
graph_snow_requests_vs_actual_requests(df)
In [14]:
graph_snow_requests_vs_actual_requests(df[winter_2014_2015])

Winter 14-15:

- More snow plow requests than other requests occur on days that have > 0.0 snowfall in inches
- When there is no snow fall, most reports do not have the word snow

Overall:

- There are more non-snow plow requests on days that have > 0.0 snowfall
- Almost no requests for snow plowing on days that have no snow

311 Reports by Region

Here, we correlate the frequency of 311 reports with each region in Boston by counting the occurrences of each neighborhood in the "neighborhood" column of the 311 dataset.

In [15]:
# Data manipulation: get counts of reports by neighborhood
value_counts = df["neighborhood"].value_counts()

x_pos = np.arange(len(value_counts.keys()))
plt.bar(x_pos, list(value_counts[key] for key in value_counts.keys()), align='center',
        color='green', ecolor='black')
plt.xticks(x_pos, value_counts.keys(), rotation='vertical')
plt.xlabel("neighborhood")
plt.ylabel("frequency")
plt.title("region correlated with frequency of 311 reports")
plt.show()

The data is sorted from the most to the least. From the image, we can tell that Dorchester has the most 311 reports, which is a lot more than the other regions. We looked up the population of each region and found that Dorchester has the largest population following by Roxbury and Brighton. Thus we guess that the frequency of 311 reports of each region has a direct relationship with the population of that region.

Source: Boston population by neighborhood

311 Reports by Median Household Income

After looking at which regions have the most 311 requests, we looked at the median household income of each of those regions to try to find a correlation between median household income and frequency of 311 requests.

In [16]:
# Data manipulation: get counts of reports by region
value_counts = df["median household income"].value_counts()

# Plot how many of our reports come from low vs high income areas
plt.scatter(list(map(lambda x: int(x), value_counts.keys())), list(value_counts[key] for key in value_counts.keys()))
plt.xlabel("income ($)")
plt.ylabel("frequency")
plt.title("income correlated with number of 311 reports")
plt.show()

We relate the median income of each region and the frequency of 311 reports into a dot plot. Looking through the median income, most of the corresponding frequency data are gathered in the rage of 2500 to 10000, regardless of the income. We thought there might be more reports in the relatively poor regions, however we can’t draw any pattern from the dot plot. Thus the income and the frequency of 311 reports don’t have a direct relationship.

Extreme Weather

Next, we investigate a few types of extreme weather, and how these types of weather may affect the kinds of 311 requests that come in. These are windy, snowy, chilly. It is interesting to see which sorts of requests occur more for certain types of weather than others.

One visually effective way to observe trends in word choice is by generating word clouds. Below, we define which requests fit our criteria for each weather type, and then we generate wordclouds from the request text for those requests.

In [17]:
## WINDY ##
windy = df['Avg Wind (mph)'] > 30

## SNOWY ##
snowy = df['days_since_last_snow'] < 2

## CHILLY ##
celcius = df['Low Temp (F)'].apply(lambda x: (x - 32) * (5.0/9.0))
chilly = celcius < (-20)
In [18]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
def make_wordcloud_from_text(text):
    wordcloud = WordCloud(
        width = 3000,
        height = 2000,
        background_color = 'white',
        stopwords = STOPWORDS).generate(text)
    fig = plt.figure(
        figsize = (20, 15),
        facecolor = 'white',
        edgecolor = 'white')
    plt.imshow(wordcloud, interpolation = 'bilinear')
    plt.axis('off')
    plt.tight_layout(pad=0)
    plt.show()
    
def make_wordcloud(df):
    text_cols = ['case_title', 'reason', 'type']
    make_wordcloud_from_text(" ".join(df[col_name].str.cat(sep=" ") for col_name in text_cols))

Snowy

The first type of extreme weather that we showcase is the snowy case. We define this to be days on which it snowed, as well as the two days immediately following a snow day. The biggest words are Snow Plowing Request, which fits our expectations. There are also requests to repair potholes, and reports of missed trash pickups. There are many requests for Scheduled Bulk Item Pickup, which baffles us a little bit - perhaps these are reports of missed pickups.

An interesting snow-related request is "Parking Enforcement" - I wonder how many of these are people who had marked their spot with lawn chairs. More likely, since snow makes many previous parking spaces unviable, people are more likely to park illegally and thus trigger more parking enforcement-related 311 calls.

In [19]:
make_wordcloud(df[snowy])

Windy

In the wordcloud below, observe the words that come up most often for windy days (days with an average windspeed of 30 mph or higher). Tree related requests, as well as Downed Wire reports are very common. Street lights, traffic lights, signal repair, all seem to show that Boston's traffic flow control infrastructure is vulnerable to high winds. There are also many requests for street cleaning

In [20]:
make_wordcloud(df[windy])

Chilly

Snow is only one of the challenges Bostonians face in the winter. The other one is just sheer cold. What kinds of calls do Bostonians make in subfreezing temperatures?

The biggest words by far come from the request "Heat Excessive Insufficient" which seems like a catch-all type for heating related issues.

In [21]:
make_wordcloud(df[chilly])

Conclusion

  • Not sure how to normalize our data
  • Snow has a delayed effect - requests related to snow can happen the day after it snows

Correlate 311 with demographics (income, race, etc)

correlate region with frequency of 311 reports correlate income with frequency of 311 reports

correlate weather with type of 311 report correlate weather with number of 311 reports on that day

  1. Make visualizations (individually) using our merged dataset
  2. Look at those visualizations and come up with more questions to ask about the data
  1. Merge demographics dataset into current dataset
In [22]:
pd.set_option('display.max_columns', None)
plt.hist(df['consecutive_snow_days'][df['type'] == 'Request for Snow Plowing'], bins=20)

plt.show()
In [23]:
# Data manipulation: Put snowfall into bins
plt.hist(df['High Wind Gust (mph)'], bins=50)

plt.show()
In [24]:
# What is the average days_since_last_snow for snow removal requests
# 1. filter the table for only entries about snow removal
plow_requests = df[df['case_title'].str.contains("Snow", na=False)]
# 2. for those entries, compute the average of days_since_last_snow
select_by = (plow_requests['Year'] == 2014) & (plow_requests['Month'] > 8) | ((plow_requests['Year'] == 2015) & (plow_requests['Month'] < 5))
print(plow_requests[select_by]['days_since_last_snow'].mean())
print(plow_requests['days_since_last_snow'].mean())
0.281981759921124
0.41537880548454564
In [25]:
# For longer consecutive snow days are there more requests?
In [26]:
# df[has_snow].filter(['Year', 'Month', 'Day', 'days_since_last_snow']).groupby(by=['Year', 'Month', 'Day']).groupby(by='days_since_last_snow').count()